{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "adcf9054dbd97916",
   "metadata": {},
   "source": [
    "# 第39章 基于策略的方法"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6438df9d",
   "metadata": {},
   "source": [
    "## 习题39.1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ebdd547",
   "metadata": {},
   "source": [
    "&emsp;&emsp;推导式(39.18)。证明当$\\text{Var}(\\nabla_\\theta \\log \\pi (\\tau | \\theta) (G(\\tau) - B)) = 0 $，基线$B$满足以下条件。\n",
    "$$\n",
    "B = \\frac{\\mathbb{E}_{\\tau \\sim \\pi(\\tau | \\theta)} (\\nabla_\\theta \\log \\pi (\\tau | \\theta))^2 G(\\tau) }{ \\mathbb{E}_{\\tau \\sim \\pi(\\tau | \\theta)} (\\nabla_\\theta \\log \\pi (\\tau | \\theta))^2 }\n",
    "$$\n",
    "提示：利用$\\text{Var}(x) = \\mathbb{E}[x^2] - \\mathbb{E}[x]^2$。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a24401c0",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a26f336d",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3abf0c86",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3e029f5",
   "metadata": {},
   "source": [
    "## 习题39.2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11c951f9",
   "metadata": {},
   "source": [
    "&emsp;&emsp;解释为什么本章介绍的REINFORCE和演员-评论员是在策略学习算法，而不是离策略方法。考虑什么样的情况下需要离策略学习的REINFORCE和演员-评论员算法。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66e44a47",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64290b3a",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86bf1cc9",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f956c2a",
   "metadata": {},
   "source": [
    "## 习题39.3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49cc435f",
   "metadata": {},
   "source": [
    "&emsp;&emsp;本章介绍的演员-评论员算法使用蒙特卡罗法进行数据采样，学习是小批量模式。试写出使用时间差分法进行数据采样，学习是在线模式的算法。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cec3e82",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c11f559",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aab7ccde",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8cfa0c9",
   "metadata": {},
   "source": [
    "## 习题39.4"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed50e8e7",
   "metadata": {},
   "source": [
    "&emsp;&emsp;为什么演员-评论员算法直接定义目标函数的梯度函数，而不定义目标函数？请给出解释。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a42a9c8c",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffe9b30b",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c99b546",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1dadc61",
   "metadata": {},
   "source": [
    "## 习题39.5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85614959",
   "metadata": {},
   "source": [
    "&emsp;&emsp;证明策略梯度定理在使用优势函数时仍然成立。\n",
    "$$\n",
    "\\nabla_\\theta J(\\theta) = \\mathbb{E}_{\\rho_{\\theta}(s)} \\left [ \\mathbb{E}_{\\pi_{\\theta}(a|s)} \\left [\\nabla_{\\theta} \\log \\pi_{\\theta}(a | s) A_{\\pi_{\\theta}} (s, a) \\right ] \\right ]\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "034423f9",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f40d483",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1b188cd",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea7c08c0",
   "metadata": {},
   "source": [
    "## 习题39.6"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1d1d157",
   "metadata": {},
   "source": [
    "&emsp;&emsp;考虑为什么在阿尔法狗的学习中，策略网络的学习要通过新旧策略网络对弈，而价值网络的学习通过已学好的策略网络的自己对弈。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b9cd403",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b11a442",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b2e1493",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2ad0dca8d50c14",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
